[SEDONA-714] Add geopandas to spark arrow conversion. #1825

Imbruced · 2025-02-23T20:54:21Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide
No, I haven't read it.

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format [SEDONA-XXX] my subject.
No:
- this is a documentation update. The PR name follows the format [DOCS] my subject
- this is a CI update. The PR name follows the format [CI] my subject

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
Yes, I have updated the documentation.
No, this PR does not affect any public API so no need to change the documentation.

jiayuasu · 2025-02-24T02:32:38Z

@paleolimbot

Imbruced · 2025-02-24T08:45:02Z

I ll fix the missing function issue

Imbruced · 2025-02-24T09:21:57Z

Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame. I don't know when the release will be.

paleolimbot

This is awesome! I'm new to this code base, so consider my comments optional nits 🙂

Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame

Based on this PR I'm happy to attempt backporting GeoArrow import of anything implementing __arrow_c_stream__, circumventing a materialize of the GeoPandas data frame as a follow-up 🙂

paleolimbot · 2025-02-24T09:49:58Z

python/sedona/utils/geoarrow.py

+from pyspark.sql import SparkSession
+from pyspark.sql import DataFrame
+from pyspark.sql.types import StructType, StructField, DataType, ArrayType, MapType
+import pyarrow as pa


I am not sure what the dependency situation is like for spark, but it may be worth making this a lazy import (e.g., like in dataframe_to_arrow so that when we import from seconda.utils.geoarrow from sedona/spark/__init__.py we don't necessarily require pyarrow to be installed (alternatively, we could add pyarrow to the apache-sedona[spark] extras to match the runtime requirement).

Good idea. I'll make all the changes later today. Thank you for the review!

paleolimbot · 2025-02-24T09:53:16Z

python/sedona/utils/geoarrow.py

+        return [gen_new_name[name]() for name in names]
+
+
+def _deduplicate_field_names(dt: DataType) -> DataType:


Suggested change

def _deduplicate_field_names(dt: DataType) -> DataType:

# Backport from Spark 4.0

# https://github.com/apache/spark/blob/3515b207c41d78194d11933cd04bddc21f8418dd/python/pyspark/sql/pandas/types.py#L1385

def _deduplicate_field_names(dt: DataType) -> DataType:

python/sedona/utils/geoarrow.py

Imbruced · 2025-02-24T16:01:44Z

This is awesome! I'm new to this code base, so consider my comments optional nits 🙂

Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame

Based on this PR, I'm happy to attempt backporting GeoArrow import of anything implementing __arrow_c_stream__, circumventing a materialize of the GeoPandas data frame as a follow-up 🙂

@paleolimbot
That would be great. When writing Chapter 6, I included examples with the code you provided, which significantly improves the transformation time when we create geopandas from Sedona. Adding something similar to Sedona from Geopandas would be great, so I added this MR. I am happy to apply all the changes you mentioned.

jiayuasu

Can you add documentation to this page? https://sedona.apache.org/latest/tutorial/geopandas-shapely/

Imbruced · 2025-02-25T21:34:01Z

Can you add documentation to this page? https://sedona.apache.org/latest/tutorial/geopandas-shapely/

sure

Co-authored-by: Dewey Dunnington <dewey@wherobots.com>

docker/docs/Dockerfile

paleolimbot

Apologies for the late review...this is awesome! Thank you!

github-actions bot added the sedona-python label Feb 23, 2025

paleolimbot reviewed Feb 24, 2025

View reviewed changes

jiayuasu changed the title ~~SEDONA-714 Add geopandas to spark arrow conversion.~~ [SEDONA-714] Add geopandas to spark arrow conversion. Feb 24, 2025

Imbruced marked this pull request as ready for review February 25, 2025 10:18

Imbruced requested a review from jiayuasu as a code owner February 25, 2025 10:18

jiayuasu requested changes Feb 25, 2025

View reviewed changes

Imbruced and others added 8 commits February 25, 2025 22:45

SEDONA-714 Add geopandas to spark arrow conversion.

84b98d3

SEDONA-714 Add geopandas to spark arrow conversion.

11fa62d

SEDONA-714 Add geopandas to spark arrow conversion.

840ed6b

SEDONA-714 Add geopandas to spark arrow conversion.

a78efe3

SEDONA-714 Add geopandas to spark arrow conversion.

03e4769

Update python/sedona/utils/geoarrow.py

7b04a4c

Co-authored-by: Dewey Dunnington <dewey@wherobots.com>

SEDONA-714 Add geopandas to spark arrow conversion.

80b0c8f

SEDONA-714 Add docs.

1c96da0

Imbruced force-pushed the SEDONA-714-add-geopandas-to-spark-arrow-conversion branch from f328661 to 1c96da0 Compare February 25, 2025 22:17

github-actions bot added docs sedona-docker root labels Feb 25, 2025

SEDONA-714 Add docs.

c00ddf7

jiayuasu reviewed Feb 26, 2025

View reviewed changes

docker/docs/Dockerfile Show resolved Hide resolved

jiayuasu approved these changes Feb 26, 2025

View reviewed changes

jiayuasu added this to the sedona-1.7.1 milestone Feb 26, 2025

jiayuasu merged commit ebd6f67 into master Feb 26, 2025
28 checks passed

paleolimbot reviewed Feb 26, 2025

View reviewed changes

jiayuasu deleted the SEDONA-714-add-geopandas-to-spark-arrow-conversion branch February 28, 2025 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-714] Add geopandas to spark arrow conversion. #1825

[SEDONA-714] Add geopandas to spark arrow conversion. #1825

Imbruced commented Feb 23, 2025

jiayuasu commented Feb 24, 2025

Imbruced commented Feb 24, 2025

Imbruced commented Feb 24, 2025

paleolimbot left a comment

paleolimbot Feb 24, 2025

Imbruced Feb 24, 2025

paleolimbot Feb 24, 2025

Imbruced commented Feb 24, 2025

jiayuasu left a comment

Imbruced commented Feb 25, 2025

paleolimbot left a comment

		return [gen_new_name[name]() for name in names]


		def _deduplicate_field_names(dt: DataType) -> DataType:

[SEDONA-714] Add geopandas to spark arrow conversion. #1825

[SEDONA-714] Add geopandas to spark arrow conversion. #1825

Conversation

Imbruced commented Feb 23, 2025

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu commented Feb 24, 2025

Imbruced commented Feb 24, 2025

Imbruced commented Feb 24, 2025

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot Feb 24, 2025

Choose a reason for hiding this comment

Imbruced Feb 24, 2025

Choose a reason for hiding this comment

paleolimbot Feb 24, 2025

Choose a reason for hiding this comment

Imbruced commented Feb 24, 2025

jiayuasu left a comment

Choose a reason for hiding this comment

Imbruced commented Feb 25, 2025

paleolimbot left a comment

Choose a reason for hiding this comment